Fix for AveragePool performance regression by guarding ceil_mode clamp#27331
Fix for AveragePool performance regression by guarding ceil_mode clamp#27331tengli-alaska wants to merge 4 commits intomicrosoft:mainfrom
Conversation
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Pull request overview
This PR addresses a performance regression in the AveragePool operator introduced in PR #16752. The regression (17-28% kernel slowdown) was caused by unconditional std::min clamping operations in the hot pooling loops, which prevented compiler auto-vectorization. The fix guards these clamps behind ceil_mode checks, restoring performance for the default ceil_mode=0 path while preserving correctness for ceil_mode=1.
Changes:
- Added
ceil_modefield toAveragePool1DTask,AveragePool2DTask, andAveragePool3DTaskstructs - Conditionally guards all
std::minclamps withif (ceil_mode)checks - Updated task initializers in
pool.ccto passpool_attrs_.ceil_mode
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| onnxruntime/core/providers/cpu/nn/pool_functors.h | Added ceil_mode field and conditional clamp guards for 1D, 2D, and 3D average pooling tasks |
| onnxruntime/core/providers/cpu/nn/pool.cc | Updated task initializations to pass ceil_mode parameter |
| .vscode/settings.json | Unrelated formatting changes and addition of C_Cpp_Runner.msvcBatchPath configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run Linux QNN CI Pipeline |
|
No pipelines are associated with this pull request. |
|
/azp run Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 3 pipeline(s). |
titaiwangms
left a comment
There was a problem hiding this comment.
I also only discovered non-significant changes.
{
"model": "/home/titaiwang/onnxruntime/profiling_avgpool/averagepool_pool_11_18_average_pool_3d_valid_input/averagepool_pool_11_18_average_pool_3d_valid_input.onnx",
"operator": "AveragePool",
"old_version": "1.20.0",
"new_version": "1.21.0",
"summary": {
"kernel_old_ms": 0.191,
"kernel_new_ms": 0.198,
"kernel_change_ms": 0.007000000000000006,
"kernel_change_pct": 3.6649214659685896,
"total_old_ms": 0.201,
"total_new_ms": 0.204,
"is_regression": false
}
}
Description
Fixes #27190
Guards the
std::minbounds clamping inAveragePool1DTask,AveragePool2DTask, andAveragePool3DTaskbehind aceil_modecheck to restore performance for the defaultceil_mode=0path.Root Cause
PR #16752 added
std::minclamping onhend/wend/dendinside the hot pooling loops to fix correctness whenceil_mode=1. However, these clamps run unconditionally, including whenceil_mode=0(the default for the vast majority of models). This introduces a data-dependent loop bound that can prevent compiler auto-vectorization of the inner accumulation loop, causing a +17% to +28% kernel slowdown as reported in #27190.Fix
When
ceil_mode=0, the output dimensions are computed with floor division, which guarantees thathstart + kernel_size <= height + pad— the clamp is a no-op. This PR guards each clamp behindif (ceil_mode), restoring the original fast path while preserving the correctness fix forceil_mode=1.Changes
pool_functors.h: Addedint64_t ceil_modefield toAveragePool1DTask,AveragePool2DTask,AveragePool3DTaskand wrapped eachstd::minclamp inif (ceil_mode)pool.cc: Passpool_attrs_.ceil_modeto the three AveragePool task initializersBenchmark Results
Regression confirmed on AMD EPYC 7543 (x86_64, Linux) using ORT's built-in profiling on
averagepool_pool_11_18_average_pool_3d_valid_input:The original issue observed +17% to +28% regression. Unable to build ORT from source locally to benchmark the patched version, would appreciate if @junghyunpark2001 could validate the fix on their setup.
Notes
ceil_mode=0(default path)ceil_mode=1correctness fix from Align AvgPool ceil_mode on last value to torch #16752 is fully preservedceil_mode=0andceil_mode=1